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In this work we describe cui efficient implementation of a hierarchy of algorithms for Gaussian 
elimination upon dense matrices over the field with two elements (F2). We discuss both well- 
known and new algorithms as well as our implementations in the M4RI library, which has been 
adopted into SAGE. The focus of our discussion is a block iterative algorithm for PLE decom- 
position which is inspired by the M4R1 algorithm. The implementation presented in this work 
provides considerable performance gains in practice when compared to the previously fastest im- 
plementation. We provide performance figures on x86_64 CPUs to demonstrate the alacrity of our 
approach. 
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1. INTRODUCTION 

Wc describe an efficient implementation of a hierarchy of algorithms for Gaussian 
elimination of dense matrices over the field with two elements (F2). In particu- 
lar, we describe algorithms for PLE decomposition [Jeannerod et al. 2011] which 
decompose a matrix A into P ■ L ■ E where P is a permutation matrix, L is a 
unit lower triangular matrix and E a matrix in row echelon form. Since matrix 
decomposition is an essential building block for solving dense systems of linear and 
non-linear equations [Lazard 1983] much research has been devoted to improving 
the asymptotic complexity of such algorithms. A line of research which produced 
decompositions such as PLUQ [Golub and Van Loan 1996], LQUP [Ibarra et al. 
1982], LSP [Ibarra et al. 1982] and PLE [Jeannerod et al. 2011], among others. 
Each of those decompositions can be reduced to matrix-matrix multiplication and 
hence can be achieved in 0{n") time where co is the complexity exponent of linear 
algebra^. However, in [Jeannerod et al. 2011] it was shown that many of these 



^For practical purposes we set oj 
persmith and Winograd 1990]. 



= 2.807. Lower exponents have been obtained in theory [Cop- 
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decompositions are essentially equivalent (one can be produced from the other in 
negligible cost) and that for almost all applications the PLE decomposition is at 
least as efficient in terms of time and memory as any of the other. Hence, in 
this work, we focus on the PLE decomposition but consider the special case of F2. 
In particular, we propose a new algorithm for block-iterative PLE decomposition 
and discuss its implementation. We also describe our implementation of previously 
known algorithms in the M4RI library [Albrecht and Bard 2011]. 

This article, is organised as follows. We will first highlight some computer ar- 
chitecture issues and discuss in place methods, as well as define notation in the 
remainder of this section. We will start by giving the definitions of reduced row 
echelon forms (RREF) and PLE decomposition in Section 2. We will then discuss 
the three algorithms and their implementation issues for performing PLE decom- 
position in Section 3, and other implementation issues in Section 4. We conclude 
by giving empirical evidence of the viability of our approach in Section 5. In Ap- 
pendix A we will describe the original M4RI algorithm (for reducing F2 matrices 
into either row echelon form or RREF), which initiated this line of research. 

1.1 Computer Arcfiitecture Issues 

The M4RI library implements dense linear algebra over F2. Our implementation 
focuses on 64-bit x86 architectures (x86_64), specifically the Intel Core 2 and the 
AMD Opteron. Thus, wc assume in this work that native CPU words have 64 bits. 
However, it should be noted that our code also runs on 32-bit CPUs and on non-x86 
CPUs such as the PowerPC. 

Element-wise operations over F2, being mathematically trivial, arc relatively 
cheap compared to memory access. In fact, in this work we demonstrate that 
the two fastest implementations for dense matrix decomposition over F2 (the one 
presented in this work and the one found in Magma [Bosma et al. 1997] due to Al- 
lan Steel) perform worse for moderately sparse matrices despite the fact that fewer 
field operations are performed. Thus our work provides further evidence that more 
refined models for estimating the expected efficiency of linear algebra algorithms 
on modern CPUs and their memory architectures are needed. 

An example of the CPU-specific issues would be the "SSE2 instruction set," which 
we will have cause to mention in Section 4. This instruction set has a command to 
XOR (exclusive-OR) two 128-bit vectors, with a single instruction. Thus if we are 
adding a pair of rows of 128n bits, we can do that with n instructions. This is a 
tremendous improvement on the 128n floating-point operations required to do that 
with real-numbered problems. This also means that any algorithm which touches 
individual bits instead of blocks of 128 bits will run about two orders of magnitude 
slower than an algorithm which avoids this. 

1.2 Notations and space efficient representations 

By j] we denote the entry of the matrix A at row i and column j. By ii we 
denote the ith element of the vector i. We represent mxm permutation matrices as 
integer vectors of length m, i.e., in LAPACK-style. That is to say, the permutation 
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matrix 

"10 0" 

10 
10 
1 
1 

is stored as P = [0,2,2,4,4], where for each index i the entry Pi encodes which 
swap ought to be performed on the input. This representation allows to apply 
permutations in-place. 

All indices start counting at zero, i.e. A[2, 2] refers to the intersection of the third 
row and third column. All algorithms, lemmas, theorems, propositions in this work 
are presented for the special case of F2. Hence, any statement that some algorithm 
produces a certain output is always understood to apply to the special case of F2, 
unless stated otherwise. 

Matrix triangular decompositions allow that the lower triangular and the upper 
triangular matrices can be stored one above the other in the same amount of mem- 
ory as for the input matrix: their non trivial coefficients occupy disjoint areas, as 
the diagonal of one of them has to be unit and can therefore be omitted. For two 
m X n matrices L and U such that L = [(ij] is unit lower triangular and U = [uij] is 
upper triangular, we shall use the notation L\U, following [Jeannerod et al. 2011] 
to express the fact that L and U are stored one next to the other within the same 
m X n matrix. Thus, A = L\U is the m x n matrix A = [ciij] such that a, j- = £ij 
if i > j, and aij = Uij otherwise. We call it a space sharing representation of L 
and U. Furthermore, A = ^0 | ^1 means that A is obtained by augmenting Ao by 
Al. 

The methods of this paper not only produce space-sharing triangular decom- 
positions, but also overwrite their input matrix with the output triangular de- 
composition. However, they are not in-place, contrarily to the general algorithms 
in [Jeannerod et al. 2011], as they use tabulated multiplications that require non 
constant extra memory allocations. 

Last but not least, complexity is always reported for square matrices, but the 
complexities for over-defined and under-defined matrices can be derived with mod- 
erate effort. 

2. MATRIX DECOMPOSITIONS 

In this work we are interested in matrix decompositions that do not destroy the 
column profile, since those play a crucial role in linear algebra for Grobner basis 
computations [Faugere 1999]^. 

Definition 1 Column rank profile. The column rank profile of a matrix A 
is the lexicographically smallest sequence of column indices ia < ■ ■ • < ir , with r 
the rank of A such that the corresponding columns are linearly independent. 

Definition 2 Leading Entry. The leading entry in a non-zero row of a ma- 
trix is the leftmost non-zero entry in that row. 

^These are also useful in simplified variants of the F4 algorithm for polynomial system solving 
such as the popular [Courtois et al. 2000]. 
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Perhaps, the most weU-known of such decompositions is the row cchleon form. 

Definition 3 Row Echelon Form. An m x n matrix A is in row echelon 
form if any all-zero rows are grouped together at the bottom of the matrix, and if 
the leading entry of each non-zero row is located to the right of the leading entry of 

the previous row. 

While row echelon forms are not unique, reduced row echelon forms, whose defi- 
nition we reproduce below, are. 

Definition 4 Reduced Row Echelon Form. An m x n matrix A is in re- 
duced row echelon form if it is in row echelon form, and furthermore, each leading 
entry of a non-zero row is the only non-zero element in its column. 

In order to compute (reduced) row echelon forms, we compute the PLE decom- 
position. 

Definition 5 PLE Decomposition. Let A be anmxn matrix over a field ¥. 

A PLE decomposition of A is a triple of matrices P, L and E such that P is an 
m X m permutation matrix, L is an m x m unit lower triangular matrix, and E is 
anmxn matrix in row- echelon form, and A = PLE. 

We refer the reader to [Jeannerod et al. 2011] for a proof that any matrix has 
a PLE decomposition. There is also proven that the PLE decomposition is at 
least as efficient for almost all applications as other column rank profile revealing 
decompositions in terms of space and time complexity. 

One last technicality stems from the fact that a computer's memory is addressed 
in a one-dimensional fashion, while matrices are two-dimensional objects. The 
methods of this work use row major representations, not column major represen- 
tations. The definition is as follows: 

Definition 6 Row Major Representation. A computer program, stores a 
matrix in row major representation if the entire first row is stored in memory, 
followed by the entire second row, followed by the entire third row, and so forth. 

3. ALGORITHMS 

The PLE decomposition can be computed in sub-cubic time complexity by a block 
recursive algorithm reducing most algebraic operations to matrix multiplication 
and some other much cheaper operations [Jeannerod et al. 2011]. We denote this 
algorithm as "feZocfc recursive PLE decomposition" in this work. In principle this 
recursion could continue until 1x1 matrices are reached as a base case. However, in 
order to achieve optimal performance, these recursive calls must at some dimension 
"cross over" to a base case implementation, presumably much larger than 1x1. 
This base case has asymptotic complexity worse than the calling algorithm but 
better performance, in practice, for small dimensions. 

In this work, we also focus on optimising this base case. This is done in two 
phases: a block iterative algorithm using tabulated updates (based on the "Method 
of four Russians" inversion algorithm [Bard 2006]) denoted by BlockIterative- 
PLE on top of a lazy iterative partial decomposition which we denote PartialPLE. 
Here, lazy has a technical meaning which we define as follows. For some entry A[i, j] 
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j = c 




Fig. 1. Lazy PLE elimination: the matrix has been processed through row r and column c. 
In order to determine if the coefficient at position is non zero, the algorithm must apply 

updates of the two last non zero columns as only so is greater than i and hence the update of its 
corresponding column was already applied. 



below the main diagonal, i.e. with i > j, we do not modify row i in such a way 
as to make j] = until the last possible moment. As it comes to pass, that 
moment will be when we check to see ii A[i,i\ = Q, to see if it would be a suitable 
pivot element. 



3.1 Partial iterative algorithm 

This algorithm is used as a base case of the block iterative algorithm of section 3.2. 
It is based on the classical ©(n^) Gaussian elimination algorithm, where pivots 
are found by searching column-wise: the algorithm only moves to the next column 
when all rows positions of the current column have been found to be zero. This 
condition ensures that the E matrix will be in row echelon form and the column 
indices of the leading entries of each row (i.e. the pivots) will indicate the column 
rank profile of the input matrix. 

To improve cache efficiency, the updates are done in a "lazy" manner, inspired by 
the left looking elimination strategy: updates are delayed to the moment when the 
algorithm actually needs to check whether a coefficient is non-zero. At that moment, 
all trailing coefficients of the current row are also updated, as they are in the same 
cache page. This process requires the maintenance of a table s — [soi • • ■ i Sr-i] 
storing for each pivot column the highest-numbered row considered so far. It is 
defined by Si = max{zj,j = 0...«}, where ij is the row number where the jth 
column's pivot has been found. Consequently, the sequence Si is monotonically 
decreasing. As a result, the algorithm produces the PLE decomposition of only s = 
max{ij, J = . . . rows of the input matrix. We therefore call it a "lazy" iterative 
partial PLE decomposition. The complete algorithm is shown as Algorithm 1 and 
the key idea is illustrated in Figure 1. 

As a specialized instance of standard Gaussian elimination, algorithm 1 computes 
a partial PLE decomposition of an m x n matrix of rank r in 0{mnr) operations 
over F2. For the square case, this is obviously ©(n^r). 
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Algorithm 1: PartialPle: lazy iterative partial PLE Decomposition 
Input: A - an m X n matrix 

Result: Returns r, the rank of A, a permutation vector P of size s, a vector 
Q of size r and an integer s with r < s < m 
'L\E 



Ensures: A ^ 



echelon form, A 



where L is s x r unit lower triangular, £ is an r x n 
PLE = Aq and rank(Ao) = r. 



Ao 



begin 

r,c, 



column i has been updated to row s, . 



0,0; 

// n dim. array, s.t. 

[0,...,0]; 
while r <m and c < n do 

found i — False; 

for c < j < n do // search for some pivot 
for r < i < m do 

if r > and i > s^-i then 

Let fc < r — 1 be such that s^-i > i > s/.; 
for k<l<r do// Update row i from column j to n 
if A[i, Qi] 7^ then add the row I to the row i starting 
at column j; 
si < — i; 



True and break; 



if j] 7^ then found 
if found then break; 
if found then 

swap the rows r and i; 
r,c< — r + l,j + l; 
else 

L J ^ J + 1; 

^max max(5^); 
for < J < r do 

for Sj + 1 < i < Srnax do 

if A[i, Qj] then 
1^ add the row j to the row i starting at column j; 



// Now compress L 

for < j < r do swap the columns j and Qj starting at row j; 
return P, r, Q, s; 
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3.2 Block Iterative 

For matrices over F2 the "Method of the Four Russians" Inversion (M4RI), per- 
forms Gaussian ehmination on a random nx n matrix in 0(n^/logn). The M4RI 
algorithm was introduced in [Bard 2006] and later described in [Bard 2007] and 
[Bard 2009, Ch. 9]. It inherits its name and main idea from "Konrod's Method 
for matrix-matrix multiplication," which is often mistakenly called "The Method 
of Four Russians" despite the fact that none of the authors of that paper are ac- 
tually Russians [Arlazarov et al. 1970; Aho et al. 1974]. For the key idea behind 
the M4RI algorithm, assume that we have computed the row echelon form of an 
k X n submatrix of some matrix A and that this submatrix has rank k. Then, in 
order to eliminate below, we may use a table of all 2'' linear combinations of the 
k pivot rows found in the previous step. Hence, with only one addition per row, k 
columns are processed. Furthermore, the construction of all 2*^ linear combinations 
of the pivots can be accomplished in only 2*^ vector additions saving a factor of k 
compared to naive table construction. Finally, if we set k « log n this algorithm 
can achieve C(n^/logn) under the assumption that it always finds pivots among 
the first c • k rows it considers where c is a small constant. If this condition is not 
met it either fails (see also [Bard 2007]), or regresses to cubic complexity (this is the 
strategy is implemented in the M4RI library). We refer the reader to Appendix A 
for a more in-depth treatment of the M4RI algorithm. 

Our implementation of block-iterative PLE decomposition is based on our im- 
plementation of the M4RI algorithm and the performance reported in Section 5 
is partly due to techniques inherited from our fast implementation of the M4RI 
algorithm. Thus, we point out that our implementation of the M4RI algorithm is 
very efficient, allowing it to beat asymptotically fast implementations of Gaussian 
elimination over F2 up to matrix dimensions of several thousand despite having a 
worse asymptotic complexity. (See also Appendix A.) 

Algorithm 2, which can be seen as the PLE analog of the M4RI algorithm'^, does 
not require pivots among the first c- k rows in order to achieve ©(n^/logn). 

Informally, the algorithm proceeds as follows. We divide the input matrix A into 
vertical stripes of fc « logn columns each. Then, we perform PLE decomposition 
on the first such stripe. This operation is cheap because these stripes are narrow. 
Now, in order to update the matrix to the right of the current stripe we use the 
M4RM matrix multiplication algorithm [Albrecht et al. 2010], i.e., we compute all 
possible linear combinations of the pivot rows just discovered and use the rows of 
the transformation matrix L as a indices in this table. Hence, for each row of A22 
we peform one row addition from the table instead of 0{k) row additions. Thus, 
regardless of the rank of the currently considered stripe, we achieve ©(n^/logn) 
because the update on the right is performed using M4RM. 

Note, however, that our presentation of Algorithm 2 is independent of the choice 
of matrix-multiplication algorithm for clarity of presentation. Thus it could (in 
principle) be instantiated with naive multiplication, and that would result in a 



^Note that [Albrecht 2010] and earher versions of this work constain an algorithm denoted MMPF. 
As it turns out, the MMPF algorithm suffers from the same performance regression as M4R1, in 
the case when pivots cannot be found among the first c • k considered rows. On the other hand, 
unlike M4RI, it does compute a PLE decomposition. 
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complexity of 0[n^). 

However, performing PLE decomposition on the current slice has an important 

disadvantage when compared with the M4RI algorithm: it is less cache friendly. 
Recall, that our matrices are stored in row major representation. PLE decomposi- 
tion will always touch all rows when treating the current slice. This is cheap when 
considering field operations only. However, if the matrix A is sufficient large, this 
will result in a load of a new cache line for each considered row. To mitigate this 
effect, we use PartialPle as defined in the last section. 

Every call to PartialPle returns some r and some s, which arc the dimensions 
of the L matrix calculated by the PartialPle call. Recall, the dimensions of A are 
mxn and k is the parameter of our algorithm. Assume that a call to PartialPle 
has returned with f = k and some s < m and hence that we can decompose our 
matrix as follows 

/ -4oo vloi Am \ 

^ _ AiQ All = Lii \ Ell All 

AlQ A21 = L21 A22 

\ A30 A31 ^32 / 

where An has r rows and A21 has s — r rows. We note that if f < /c we have s = m 
and hence ^31 a matrix with zero rows, i.e., we only need to update A22 which 
reduces to matrix multiplication as discussed above. Hence, we may focus on the 
f = k case in our discussion. 

Algorithm 2 proceeds by first computing A12 {Lii)~^ ■A12. We then construct 
a table T of all 2*^ linear combinations of the rows of Ai2. Using Gray codes or 
similar techniques these tables can be constructed using 2'^ vector additions only 
[Albrecht et al. 2010]. Using this table, we can implement the map v ^ v ■ A12 for 
all 2*^ possible vectors v as the table lookup T{v). Furthmore, we can implement 
the map v ^ v ■ (-Efi^ • A12) as {v ■ E^^) ■ A12. That is, we store a table T' of all 
2'^ possible pairs {v,v ■ -Bf/) and compute v ^ v ■ Eii ■ A12 by the table lookup 

T{r{v)). 

Finally, we "postprocess" T such that we also update L when computing v ■ 
{Eii)~^, i.e., we replace any row v of A31 by v ■ Eii when adding {v ■ E'f/) • A12 
to a row of A32. For example, assume f = 3 and that the first row of the r x n 
submatrix An has the first f bits equal to [1 1] (= 5). Assume further that 
we want to clear f bits of a row which also starts with [1 1] . Then - in order 
to generate L - we need to encode that this row is cleared by adding the first row 
only, i.e., we want the first f = 3 bits to be [10 0]. Thus, to correct the table, we 
add - in the postprocessing phase of T - [1 0] to the first f bits T(5). Hence, 
when adding this row of T to A31, we produce [1 1] 8 [1 1] 8 [1 0] 
= [10 0] and store it in ^31. 

Using T and T', we can now perform the operations 

A22 <r- L21 ■ A12 + A22 in line 14. 

(^31,^32) <r- {A31 ■ {Eii)-\Asi ■ • A12 + A32) in line 15 

using one vector addition per row. The update of A22 is a multiplication by A12 
plus an addition and is hence accomplished by one lookup to T and one vector 
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addition per row. Similarly, the updates of A31 and A32 can be achieved using 
lookups in T' and T and one vector addition per row. 



Algorithm 2: Block- iterative Ple Decomposition 



Input: A - an m X n matrix 

Result: Returns r, the rank of A, a permutation P of size m and an 

r-dimensional vector Q 
Ensures: A ^ [L\ E] where L is unit lower triangular and E is in echelon 

form such that A = PLE and Qi is the column index of the ith 

pivot of E. 

1 begin 

r, c -(- 0,0; 

k i — pick some < k < n; / / 0(n^/logn) — >■ fc w logn 

[0,...,0]; 
while r <m and c < n do 
if k > n — c then 

1^ k <~ n — c; 

^00 ^01 ^02 



/* A -- 



Aqo All A12 



where Aqq ±s r x c and An is {n — r) x k 

*/ 

PartialPle( Ai 1 ) ; 



r,P,Q,s 
for < i < r do 

Qr+i ^ C ~t~ Qi^ 
Pr+i ^r + Pi- 

Am Aoi A02 
^10 Lii \ Ell A12 
A20 L21 I A22 
Azo A31 A32 
f xf , Ell is f X fc 



/* A 



where Aqq is r x c and Ln is 



'Aio 


^P^ 


'AlQ 




'A12 




'A12 


A20_ 




A20 


1 


_A22 




A22_ 



A12 i — (iii)"^ x A12; 
A22 ^ L21 • A12 + A22 ; 

(^31,^32) ^ {A31 ■ E:[^,Azi ■ E^^ ■ A12 + A32) ; 
// Now compress L 
for < « < r do 
1^ swap the columns r + i and c + i starting at row r + i; 

r, c i — r + r,c+ k; 

return P, Q, r; 



Overall we have: 

Lemma 1. Let k = logn, then Algorithm 2 computes the PLE decomposition of 
some n X n matrix A in 0(n^ /logn) field operations if the M^RM algorithm is 
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used in line 14 and 15. 

Proof. Let rj be the rank computed by each call to Algorithm 1. Note that the 
sum of the r^'s is the rank of the matrix, r. 

The overall operation count is obtained the sum of the following contributions: 



-calls to algorithm 1 (line 



i/k-l i-2 



E 



n — \^ rj)kri < nkr, 



i=0 j=0 

— solving with the x lower triangular matrix in on A12 (line 13) can be done 
with the classic cubic time back substitution algorithm: 

n/k — 1 n/k — 1 

rf{n- {i + l)k) < ^ k^n<n^k, 

i=0 i=Q 

— updates based on Gray code tabulated matrix multiplication (line 14): 

n/fe-l i n/k . , , \ 2 

J2in-{^ + l)fc)(2'= +n-J2rj)<i2'+n)J2ik< ^A+J^^ 

i=0 j=0 i=l 

— updates in line 15 share the same table as the previous one, line 14 (all linear 
combinations of rows of A12). Hence they only take: 

n/k—1 2 

J2 (n-(i + l)fc)r,<^. 

For k = \ogn, the overall complexity is 0(n^/logn). □ 
3.3 Block Recursive 

In [Jeannerod et al. 2011] an algorithm is given that computes a PLE decomposition 
in-place and in time complexity 0{n^) by reduction to matrix-matrix multiplica- 
tion. This algorithm can be recovered by setting fc = n/2 in Algorithm 2 and by 
replacing PartialPle with Ple, We give it explicity in Algorithm 3 for clarity 
of presentation. The asymptotic complexity of the algorithm depends on matrix 
multiplication algorithm used. Our implementation uses Strasscn-Winograd and 
hence achieves the 0(n") running time; this is the subject of a previous paper 
[Albrecht et al. 2010]. We note that ^01 ^ — (-^oo)~^ x ^01 is the well-known tri- 
angular solving with matrices (TRSM) operation with a lower triangiilar matrix on 
the left which can be reduced to matrix-matrix multiplication as well [Jeannerod 
et al. 2011]. 

4. IMPLEMENTATION DETAILS 

In our implementation we call Algorithm 3 which recursively calls itself until the 
matrix fits into L2 cache when it calls Algorithm 2. Our implementation of Algo- 
rithm 2 uses four Gray code tables. (See also [Albrecht et al. 2010].) 

One of the major bottlenecks are column swaps. Note that this operation is usu- 
ally not considered when estimating the complexity of algorithms in linear algebra, 
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Algorithm 3: Block-recursive PLE Decomposition 



Input: A - an m X n matrix 

Result: Returns r, the rank of A, a permutation P of size m and an 

r-dimensional vector Q 
Ensures: A ^ [L\E] where L is unit lower triangular and E is in echelon 
form such that A = PLE and Qi is the column index of the ith 
pivot of E. 

begin 

n' < — pick some integer < n' <n; // n' « n/2 



7 




- P' - 






/* A = 


Loo \ Eoo Aqi 
Aio All 


where 






*/ 






8 


if r' then 




9 




Ai^ 


^PxAi; 




10 




Aoi ^ 


— {Loo) ^ X Aoi; 


11 




An^ 


— Aii + Aio X Aoi; 



/* A= ^Aq Aij where Aq is m x n' . 
r',P',Q'i — Ple(^o); // first recursive call 



*/ 



for <i <r' do 

L ft ^ Q'i, 

for < « < r' do 



■^00 



IS 



and Eoo is 



r",P",Q" i — Ple {All); II second recursive call 

Aio ^ P" X Aio; 
I* Update P & Q 
for < i < m — r' do 

for < I < n — n' do 

[_ Q„'+^ ^ Q'l + n'; 



|_ Qj i — Qi; 3^3 + 1; 

/* Now compress L 

3 ^ n'; 

for r' <i <r' + r" do 
1^ swap the columns i and j starting at row i; 

return P,Q,r' + r"; 



*/ 



*/ 
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since it is not a field operation. In Algorithm 4 a simple algorithm for swapping 
two columns a and h is given with bit-level detail. In Algorithm 4 we assume that 
the bit position of a is greater than the bit position of h for clarity of presentation. 
The advantage of the strategy in Algorithm 4 is that it uses no conditional jumps 
in the inner loop, However, it still requires 9 instructions per row. On the other 
hand, we can add two rows with 9 • 128 = 1152 entries in 9 instructions if the 
"SSE2 instruction set" is available. Thus, for matrices of size 1152 x 1152 it takes 
roughly the same number of instructions to add two matrices as it does to swap two 
columns. If we were to swap every column with one other column once during some 
algorithm it thus would be as expensive as a matrix multiplication (for matrices of 
these dimensions). 



Algorithm 4: Column Swap 
Input: ^ - an TO X n matrix 
Input: a - an integer < a < 6 < n 
Input: 6 - an integer < a < 6 < n 
Result: Swaps the columns a and h'm A 



Another bottleneck for relatively sparse matrices in dense "row-major represen- 
tation" is the search for pivots. Searching for a non-zero element in a row can be 
relatively expensive due to the need to identify the bit position; hence, we cannot 
work with machine words in bulk but have to read individual bits and see which 
column they are in. However, the main performance penalty is due to the fact that 
searching for a non-zero entry in one column in a row-major representation is very 
cache unfriendly. 

Indeed, both our implementation and the; implementation available in Magma 
suffer from performance regression on relatively sparse matrices as shown in Fig- 
ure 2. We stress that this is despite the fact that the theoretical complexity of ma- 
trix decomposition is rank sensitive. More precisely, strictly fewer field operations 
must be performed for low-rank matrices. This highlights the distinction between 
minimizing the running time and minimizing the number of field operations — ^these 
are not identical objectives. While the penalty for relatively sparse matrices is 
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much smaller for our implementation than for Magma, it clearly does not achieve 
the theoretically optimum performance. 



.S 4 




\AY\y^ Magma 



■ M4RI 
PLE 



~i 1 1 r~ 

6 10 14 18 

non-zero elements per row 



Fig. 2. Gaussian elimination of 10, 000 x 10, 000 matrices on Intel 2.33GHz Xeon E5345 comparing 
Magma 2.17-12 and M4RI 20111004. 



5. PERFORMANCE 

In Table I, we consider reduced row echelon forms, i.e., we first compute the PLE 

decomposition which we then use to recover the reduced row echelon form, of dense 
random matrices over F2. We compare the running times for Algorithm 3 (with 
either cubic PLE decomposition or Algorithm 2 as base case). 



Matrix Dimension 


base case: cubic PLE 


base case: Algorithm 2 


10, 000 X 10, 000 


1.478s 


0.880s 


16,384 X 16,384 


5.756s 


3.570s 


20, 000 X 20, 000 


8.712s 


5.678s 


32, 000 X 32, 000 


29.705s 


24.240s 



Table I. Comparing base cases on 64-bit Debian/GNU Linux, 2.33Ghz Core 2 Duo. 



In Table II, we give the average running time over ten trials for computing re- 
duced row echelon forms of dense random n x n matrices over F2. We compare 
the asymptotically fast implementation due to Allan Steel in Magma, the cubic 
Gaussian elimination implemented by Victor Shoup in NTL, and both our im- 
plementations. Both the implementation in Magma and our PLE decomposition 
reduce matrix decomposition to matrix multiplication. A discussion and compari- 
son of matrix multiplication in the M4RI library and in Magma can be found in 
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[Albrecht et al. 2010]. In Table II, the column 'PLE' denotes the complete run- 
ning time for first computing the PLE decomposition and the computation of the 
reduced row echelon form from the PLE form. 





64-bit Linux, 2.6Ghz Opteron 


n 


Magma 


NTL 


M4RI 


PLE 




2.15-10 


5.4.2 


20090105 


20100324 


10, 000 


3.351s 


18.45s 


2.430s 


1.452s 


16,384 


11.289s 


72.89s 


10.822s 


6.920s 


20, 000 


16.734s 


130.46s 


19.978s 


10.809s 


32, 000 


57.567s 


479.07s 


83.575s 


49.487s 


64, 000 


373.906s 


2747.41s 


537.900s 


273.120s 




64-bit Linux, 2.33Ghz Xeon (E5345) 


n 


Magma 


NTL 


M4RI 


PLE 




2.16-7 


5.4.2 


20100324 


20100324 


10, 000 


2.660s 


12.05s 


1.360s 


0.864s 


16, 384 


8.617s 


54.79s 


5.734s 


3.388s 


20, 000 


12.527s 


100.01s 


10.610s 


5.661s 


32, 000 


41.770s 


382.52s 


43.042s 


20.967s 


64, 000 


250.193s 




382.263s 


151.314s 



Table IL Gaussian elimination for random matrices n x n matrices 
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A. THE M4RI ALGORITHM 

Consider the matrix A of dimension m x n in Figure 3. Because A; = 3 here, the 
3 X n submatrix on the top has full rank and we performed Gaussian elimination on 
it. Now, we need to clear the first k columns of A in the rows below k (and above 
the submatrix in general if we want the reduced row echelon form). There are 2^^ 
possible linear combinations of the first k rows, which we store in a table T. We 
index T by the first k bits (e.g. Oil — > 3). Now to clear k columns of row i we use 
the first k bits of that row as an index in T and add the matching row of T to row i, 
causing a cancellation of k entries. Instead of up to k additions, this only costs one 
addition, due to the pre-computation. Using Gray codes (see also [Albrecht et al. 
2010]) or similar techniques this pre-computation can be performed in 2* vector 
additions and the overall cost is 2*^ + m — A; + fc^ vector additions in the worst case 
(where k"^ accounts for the Gauss elimination of the fc x n submatrix). The naive 
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approach would cost k ■ m row additions in the worst case to clear k columns. If we 
set k = logm then the complexity of clearing k columns is 0[m + log^ m) vector 
additions in contrast to 0{m- logm) vector additions using the naive approach. 
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Fig. 3. M4RI Idea 



This idea leads to Algorithm 8. In this algorithm the subroutine GaussSubmatrix 
(see Algorithm 7) performs Gauss elimination on a A; x n submatrix of A starting 
at position (r, c) and searches for pivot rows up to m. If it cannot find a submatrix 
of rank k it will terminate and return the rank k found so far. 

The subroutine MakeTable (see Algorithm 6) constructs the table T of all 2*^ 
linear combinations of the k rows starting at row r and column c. More precisely, it 
enumerates all elements of the vector space span(r, ...,r + k — l) spanned by the rows 
r, . . . ,r + k - 1. Finally, the subroutine AddRowsFromTable (see Algorithm 5) 
adds the appropriate row from T indexed by k bits starting at column c — to 
each row of A with index i ^ {r, . . . , r + fc — 1}. That is, it adds the appropriate 
linear combination of the rows {r, . . . ,r + k — 1} onto a row i in order to clear k 
columns. 



Algorithm 5: AddRowsFromTable 



Input: ^ - an m X n matrix 
Input: r start - an integer < r^tart < m 
Input: rend ^ an integer < rgtart < T'cnd < iri 
Input: Cstart an integer < Cstart < n 
Input: fc - an integer A: > 
Input: T - a 2*^ X n matrix 
Input: L - an integer array of length 2*^ 
1 begin 

for rstart < i < rend do 



j < — Lid; 

add row j from T to the row i of A starting at column Cgtart; 
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Algorithm 6: MakeTable 



A - an m X n matrix 
?"start - an integer < r^tart < "^ 
Cstart - an integer < Cstart < n 
fc - an integer A: > 



Input 
Input 
Input 
Input 

Result: Retuns an 2*^ x n matrix T 

1 begin 

2 T i — the 2^ X n zero matrix; 

3 for 1 < z < 2*^ do 

4 i i — the row index of A to add according to the Gray code; 

5 add row j of A to the row i oi T starting at column Cgtart! 

6 L i — integer array allowing to index T by fc bits starting at column Cgtart! 

7 return T, L; 



Algorithm 7: GaussSubmatrix 



Input: A - an m X n matrix 
Input: r - an integer < r < m 
Input: c- an integer < c < n 
Input: fc - an integer fc > 

Input: Tend an integer < r < rend < fn 

Result: Returns the rank fc < fc and puts the fc x (n — c) submatrix starting 
at A[r, c] in reduced row echelon form. 

begin 

^start 

for c < j < c + fc do 
found i — False; 
for rstart < i < rend do 

for < Z < J — c do // clear the first columns 

if c + 7^ then add row r + I to row i of ^ starting at 
column c + I; 

if A[i,j] ^ then // pivot? 

Swap the rows i and r start in A; 
for r < I < rstart do // clear above 

if A[l,j] 7^ then add row rstart to row I in A starting at 
column j; 

r start ^ r start ~t~ Ij 

found i — True; 
break; 



if found =False then 
1^ return j - c; 

return j - c; 
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Algorithm 8: M4RI 



Input: A - an m X n matrix 
Input: k an integer k > 
Result: A is in reduced row echelon form. 
1 begin 

r,ci — 0, 0; 
while c < n do 

if c + k > n then k n — c; 
ki — GaussSubmatrix(^, r,c,fc,m); 
if fc > then 

T, L i — MakeTable(A, r, c, k); 
AddRowsFromTable(A, 0, r, c, k, T, L) ; 
AddRowsFromTable(^, r + fc, m, c, fc, T, L); 

r, c i — r + k, c + k; 
if fc ^ fc then c c + 1; 



When studying the performance of Algorithm 8, we expect the hmction Make- 
Table to contribute most. Instead of performing k/2 ■ 2*^ — 1 additions Make- 
Table only performs 2*^ — 1 vector additions. However, in practice the fact that k 
columns are processed in each loop iteration of AddRowsFromTable contributes 
signficiantly due to the better cache locality. Assume the input matrix A does not 
fit into L2 cache. Gaussian elimination would load a row from memory, clear one 
column and likely evict that row from cache in ordcir to make room for the next 
few rows before considering it again for the next column. In the M4RI algorithm 
more columns are cleared per memory load instruction. 

We note that our presentation of M4RI differs somewhat from that in [Bard 
2007]. The key difference is that our variant does not throw an error if it cannot 
find a pivot within the first 3k rows in GaussSubmatrix. Instead, our variant 
searches all rows and consequently the worst-case complexity is cubic. However, on 
average for random matrices we expect to find a pivot within 3A; rows with over- 
whelming probability [Bard 2009, Ch. 9.5.3] and thus the average-case complexity 
is O(n^/logn). 

In the Table HI, we give running times for computing the reduced row echelon 
form (or row echelon form for NTL) for random matrices of various dimensions. 
The column 'M4Rr in both tables refers to our implementation of Algorithm 8 
using implementation techniques from [Albrecht et al. 2010] such as multiple pre- 
computation tables. The implementation that NTL uses is straight-forward Gaus- 
sian elimination and thus serves as a baseline for comparison while Magma imple- 
ments asymptotically fast matrix elimination. 

Table HI shows that our implementation of the M4RI algorithm (with average- 
case complexity 0(n^ / lognj) is competitive with Magma's asymptotically fast 
algorithm (with complexity C(n")) up to 64,000 x 64,000 on the Core 2 Duo and 
up to 16, 384 X 16, 384 on the Opteron. 
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64-bit Debian/GNU Linux, 2.33Ghz Core 2 Duo 


Matrix 


Magma 


NTL 


M4RI 


Dimensions 


2.14-17 


5.4.2 


20090105 


10,000 X 10,000 


2.474s 


14.140s 


1.532s 


16, 384 X 16, 384 


7.597s 


67.520s 


6.597s 


20, 000 X 20, 000 


13.151s 


123.700s 


12.031s 


32,000 X 32,000 


39.653s 


462.320s 


40.768s 


64, 000 X 64, 000 


251.346s 


2511.540s 


241.017s 


64-bit Debian/GNU Linux, 2.6Ghz Opteron 


Matrix 


Magma 


NTL 


M4RI 


Dimensions 


2.14-13 


5.4.2 


20090105 


10,000 X 10,000 


3.351s 


18.45s 


2.430s 


16, 384 X 16, 384 


11.289s 


72.89s 


10.822s 


20, 000 X 20, 000 


16.734s 


130.46s 


19.978s 


32, 000 X 32, 000 


57.567s 


479.07s 


83.575s 


64, 000 X 64, 000 


373.906s 


2747.41s 


537.900s 



Table IIL RREF for Random Matrices 
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